Document Clustering at NTCIR-4 Workshop: Limiting Search Space of the K-Means Method Using Word Occurrence

نویسندگان

  • Yuji Kaneda
  • Naonori Ueda
  • Kazumi Saito
چکیده

In this paper, we propose a new document clustering method based on the K-means method (kmeans). In our method, we allow only finite candidate vectors to be representative vectors of kmeans. We also propose a method for constructing these candidate vectors using documents that have the same word. We participated in NTCIR-4 WEB Task D (Topic Classification Task) and experimentally compared our method with kmeans on this task.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Patent Map Generation Using Concept-Based Vector Space Model

This paper proposes a patent map generation system using concept-based vector space model and presents evaluation results from the NTCIR-4 patent feasibility study (FS) task. The concept-base is a knowledge base of words, which expresses each word as an associated vector. The word vectors are computed based on word co-occurrence in a target document set, therefore, the word vectors reflect targ...

متن کامل

Experiments on Patent Retrieval at NTCIR-4 Workshop

In the Patent Retrieval Task in NTCIR-4 Workshop, the search topic is the claim in a patent document, so we use the claim text and the IPC information for the similarity calculations between the search topic and each patent document in the collection. We examined the effectiveness of the similarity measure between IPCs and the term weighting for the occurrence positions of the keyword attribute...

متن کامل

Fuzzy Clustering Approach Using Data Fusion Theory and its Application To Automatic Isolated Word Recognition

 In this paper, utilization of clustering algorithms for data fusion in decision level is proposed. The results of automatic isolated word recognition, which are derived from speech spectrograph and Linear Predictive Coding (LPC) analysis, are combined with each other by using fuzzy clustering algorithms, especially fuzzy k-means and fuzzy vector quantization. Experimental results show that the...

متن کامل

Comparing k-means clusters on parallel Persian-English corpus

This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...

متن کامل

Chinese and Korean Topic Search of Japanese News Collections

UC Berkeley participated in the pivot bilingual task of the CLIR track at NTCIR Workshop 4. Our focus was on Chinese and Korean searches against the Japanese News document collection, using English as a pivot language. For comparison of our pivot techniques, we submitted Japanese monolingual and English Japanese bilingual search rankings as well. Two different commercial translation software pa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004